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Abstract. A number of recent analyses of cosmological data have reported hints for the 
presence of extra radiation beyond the standard model expectation. In order to test the 
robustness of these claims under different methods of constructing parameter constraints, 
we perform a Bayesian posterior-based and a likelihood profile-based analysis of current 
data. We confirm the presence of a slight discrepancy between posterior- and profile-based 
constraints, with the marginalised posterior preferring higher values of the effective number 
of neutrino species N c g. This can be traced back to a volume effect occurring during the 
marginalisation process, and we demonstrate that the effect is related to the fact that cosmic 
microwave background (CMB) data constrain 7V e g only indirectly via the redshift of matter- 
radiation equality. Once present CMB data are combined with external information about, 
e.g., the Hubble parameter, the difference between the methods becomes small compared 
to the uncertainty of N c g. We conclude that the preference of precision cosmological data 
for excess radiation is "real" and not an artifact of a specific choice of credible/confidence 
interval construction. 
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1 Introduction 

In the past years, measurements of the temperature and polarisation anisotropics in the 
cosmic microwave background have revealed a wealth of information about the Universe. One 
particular quantity that can be inferred from CMB data is the relativistic energy density p T 
around decoupling, typically expressed in terms of the effective number of massless neutrino 
degrees of freedom iV c fj : 

Pr = — T 7 4 (1 + aN cS ) , (1.1) 

where T 7 = (2.72548 ± 0.00057) K [1] is the CMB temperature and a = § (^-) 4/3 . The three 
standard model neutrino species are expected to contribute N e g = 3.046 effective degrees 
of freedom [2]. Intriguingly however, present cosmological data show some indication for 
iV c ff > 3.046 [3-9], hinting at the possible existence of further light particle species. These 
hints are based on a Bayesian statistics analysis of the data however, and as long as the 
evidence for 7V e g > 3.046 is weak, one might also want to consider an alternative approach 
of constraining N e g. A profile likelihood analysis for instance, being prior- independent and 
parameterisation-invariant, provides a useful cross-check of these results and is complemen- 
tary to the usual Bayesian analysis based on the posterior probability density [10]. Using a 
profile likelihood-based analysis, it was recently claimed in [11] that the hints for iV e fj > 3.046 
are merely artifacts of the Bayesian construction of parameter constraints. We shall revisit 
this claim in the present work. 

This paper is organised as follows: we will describe the details of our analysis in section 2, 
present our results in section 3 and conclude in section 4. 

2 Analysis 
2.1 Data sets 

For clarity of presentation and given the considerable numerical effort required to reliably 
construct the profile likelihood, we will limit ourselves to two different combinations of data: 
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Table 1. Parameters and prior ranges for the cosmological and nuisance parameters. For each 
individual analysis only one out of the first three parameters is used. 



Parameter 


symbol 


Prior 


Hubble parameter 


li 


0.4 -» 


1.0 


Dark energy density 


Ha 


->■ 


1 


Ratio of sound horizon to angular diameter distance at decoupling 


e s 


0.5 -> 


10 


Baryon density 




0.005 


0.1 


Cold dark matter density 




0.01 -> 


0.99 


Amplitude of scalar spectrum @ k = 0.05 Mpc -1 


logllO 10 ^] 


2.7 -> 


4 


Scalar spectral index 


n s 


0.5 -> 


1.5 


Redshift of reionisation 


z rc 


1 -> 


50 


Effective number of massless neutrinos 


AW 


1.5 -> 


10 


Amplitude of Sunyaev-Zel'dovich contribution 


A sz 


-> 


3 


Amplitude of clustered point source contribution 


A c 


-> 


20 


Amplitude of Poisson point source contribution 


A P 


-> 


100 



1. A CMB only set, consisting of the 7- year Wilkinson Microwave Anisotropy Probe data 
(WMAP7) [12] plus the 2008 Atacama Cosmology Telescope (ACT) data [13]. For this 
data set, the discrepancy between the Bayesian result of [13] and the profile likelihood 
result reported in [11] is particularly large. 

2. The same data combined with a constraint on the Hubble parameter (HST) derived by 
Riess et al. [14]. 

2.2 Model and priors 

We consider a one-parameter extension of the 6-parameter ACDM vanilla model, varying 
also the effective number of massless degrees of freedom N c g on top of the standard parame- 
ters. Additionally, three parameters describing the foreground contribution to the small-scale 
CMB temperature spectrum are required. The parameterisation of the vanilla model is not 
unique, and there are a number of different choices commonly used in the literature. Since 
the parameterisation implicitly determines the prior probability distribution, these choices 
can affect the inference of parameters, even though the physical models are equivalent. In 
this work we explicitly compare three parameterisation choices: a flat prior on the Hubble 
parameter Hq, a flat prior on the dark energy density Oa and a flat prior on the ratio of 
sound horizon to angular diameter distance at decoupling 9 S . We list all free parameters and 
their associated prior ranges in table 1. The primordial Helium fraction is fixed to Y p = 0.24 
in order to facilitate comparison with other authors' results. 

2.3 Marginalised posterior and profile likelihood 

Given a model with n free parameters, the full information of the data is contained in 
the n-dimensional likelihood function C. If one wants to construct constraints on a single 
parameter (p, the dimensionality obviously needs to be reduced. Most commonly, this is done 
in a Bayesian framework, by first promoting £ to a probability density function (through 
multiplication with a prior probability density), and then integrating ("marginalising") over 
the nuisance directions (see [10] for a more detailed discussion), resulting in the marginalised 
posterior. The marginalised posterior can easily be extracted from Markov chains and has 
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a straightforward interpretation as the probability density of the true value of tp, given the 
model, data and priors. 

Since the choice of priors (or equivalently, the choice of parameter basis [15]) may be 
somewhat subjective, one might also want to consider a prior-independent construction, such 
as the profile likelihood CP . Here, instead of integrating over the nuisance directions, one 
takes the maximum value of C for a fixed value of ip. Though the profile likelihood does 
not have a formal probabilistic interpretation, it is often used to construct approximate 
frequentist confidence intervals based on the likelihood ratio, by identifying the region for 
which AXg ff = — 2 ln(£ p (99 max ) — C p (p)) < 1 with the 68% confidence interval. We note that 
this interval may not have the desired frequentist coverage properties if the profile likelihood 
is not Gaussian [16]. 

2.4 Construction of the profile likelihood 

We construct the marginalised posteriors from Markov chains generated with a modified 
version of the public Markov chain Monte Carlo sampler CosmoMC [17], using a conservative 
Gelman- Rubin convergence criterion [18] of R— 1 < 0.01, and making sure that the numerical 
precision settings are sufficient for the data sets considered. 

Naively, one might think that one could use the same chains to construct the profile 
likelihood, by binning the data in N e g and identifying the best-fitting point in each bin. 
Unfortunately, this method does not turn out to be suitable for the case at hand. The reason 
is that the standard Metropolis-Hastings algorithm samples the region near the maximum 
of the posterior very poorly. In Appendix A we present a rough analytical estimate of the 
probability of finding at least one sample of a Markov chain within a given A%g ff of the 
best-fit. As shown in the bottom panel of figure 4, for our 10-parameter model one would 
need of order 10 5 independent samples to even have a 50% chance of the best-fitting sample 
to lie within 0.5 of the true best-fit xts- This should be compared to the typically few times 
10 correlated samples one usually has in Markov chains used for parameter estimation. The 
problem is exacerbated by the binning: in particular the estimate of CP for the bins in the 
tails of the marginalised posterior would be extremely inaccurate. 

We therefore employ a different, numerically somewhat more demanding, construction 
that avoids under-sampling of the tails and is immune to biases introduced by a binning 
procedure. On a grid of fixed values of N c s, we estimate the respective maxima of the 
likelihood by generating Markov chains at temperatures T < 1, with the temperature and 
length of chains chosen such that ln£ p is estimated with an accuracy of at least 0.1. In 
addition, we determine the global best-fit by letting N e s vary as well. 

3 Results 

In figure 1 we show the results for WMAP7+ACT data. Firstly, we note that the posteriors 
differ very little for the different priors, indicating a remarkable robustness of the results 
to the choice of prior. Secondly, the profile Ax^ ff clearly deviates from the parabolic shape 
one would expect for a Gaussian profile likelihood, showing an obvious skew towards the 
large- N E g side, so we refrain from mapping it to frequentist confidence limits. Thirdly, CP is 
markedly shifted (by up to about two thirds of a standard deviation) towards lower values of 
N e s compared to the marginalised posteriors. A similar tendency was also observed in [10], 
and, more recently, in [11] - however, their results for the same data set (both mode and 
likelihood ratio-based bounds) differ considerably from ours, possibly due to them attempting 
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Figure 1. Constraints on N e g from WMAP7+ACT data. Thin black lines denote the posterior 
probability density marginalised over the other parameter directions for three different choices of 
prior (solid: Hq, dashed: 6 S , dotted: The profile likelihood is plotted in thick red lines, both in 

terms of £ p /£^ iax (solid) and A\lg (dotted). 

to construct the profile likelihood from Markov chains that were originally generated for the 
purpose of Bayesian parameter inference. For instance, the individual best-fit estimates of 
the eight T = 1 WMAP+ACT Markov chains (each containing about 3 x 10 4 samples) we 
generated for constructing the marginalised posterior display a considerable spread, with a 
standard deviation of 0.57 - indicating the unreliability of this method. 

Is there an explanation for why larger values of N c s have a high posterior probability 
despite apparently not fitting the data too well (and vice versa for smaller iV e ff)? In [11], it 
was claimed that the effect, and, by association, also any possible hints for a deviation of 
iVeff from the standard model expectation, is "driven by prior effects" . This is a very generic 
statement however; it should be clear that any Bayesian credible intervals are always to some 
extent prior-dependent. We would like to propose a slightly more specific explanation here, 
namely that the shift of the marginalised posterior towards larger A^g-enhancement is caused 
by a volume effect in the marginalisation process. 

Let us, for a moment, imagine the full posterior were Gaussian. In that case, marginal- 
isation and profiling would lead to the same result. Also, for a Gaussian posterior, the 
variance of the other parameters' marginalised posteriors on slices of constant N e g would not 
depend on N e g. If, however, these variances did depend on N e s, and happened to be posi- 
tively correlated with N e g, then at larger (smaller) N c g there would be more (less) volume 
in the nuisance directions, and the marginalised posterior would be enhanced (suppressed) 
compared to the profile. We shall see that this is indeed the case here, and there is in fact a 
simple physical argument for why it should be so. 

As discussed in [7, 19], N c fi impacts the CMB power spectra in several ways; most 
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Figure 2. Variance of the marginalised posterior probability of oj m on slices of width A7V c ff = 1 as 
a function of Ngg. The crosses mark the values extracted from the Markov chains, the red line is 
the prediction based on the variance of z cq , consisting of a constant term induced by the bin width 
(equation (3.3), dashed line) and the intrinsic variance of cj m (equation (3.2), dotted line). 



importantly through the redshift of matter-radiation equality 

l + ^cq= = — (3.1) 

which determines the magnitude of the early integrated Sachs- Wolfe effect. It is actually 
z C q (not iV c fj or the matter density co m ) that is directly constrained by the CMB [20], and 
hence essentially uncorrelated with uo m and N e ff. Ignoring the tiny uncertainty in the photon 
energy density Uj, the variance of w m for fixed N e g is given by 

Yar(u m )\ NcB ~ Var(z eq ) (w 7 (l + aiV eff )) 2 , (3.2) 

and thus the posterior becomes wider in the cj m -direction for larger N e g. Since u> m has degen- 
eracies with other parameters, such as Hq, the widening is propagated to those directions as 
well, amplifying the total volume effect. In figure 2 we show the N e g -dependence of the pos- 
terior's width: using our original Markov chains, we evaluate the variance of the marginalised 
posterior of w m on slices of width SN e g = 1. This is compared to the expectation from the 
measurement of z eq = 3180 ± 129, which can easily be calculated from the same chains. The 
variance on these slices is composed of two components, the intrinsic one of equation (3.2), 
and a constant piece due to the bin width, given by 

Var b (w m ) = — ^z c 2 q a 2 5N% S . (3.3) 

Their sum is found to be in excellent agreement with the variances from the chains. 

The constraints on N e g can be improved by adding non-CMB data to break some of 
the parameter degeneracies, and most of the recent hints for N c g > 3.046 are based on such 



- 5 - 



combinations of data. As an example, we add the HST-constraint on Hq here, which breaks 
the N e ff-Ho degeneracy. Our results for WMAP7+ACT+HST data are shown in figure 3. 
The profile likelihood is closer to Gaussian now, and the magnitude of the volume effect 
has become much smaller - C p is shifted by roughly 0.2 with respect to the marginalised 
posteriors. If we compare this to the posterior standard deviation of ~ 0.7, we see that the 
volume effect by itself cannot account for the observed deviation from the standard model 
expectation. We summarise our results in table 2. 




Figure 3. Same as figure 1, for WMAP7+ACT+HST. 



Table 2. Summary of constraints on jV e ff from different analysis methods. For the marginalised 
posterior we list the mean (-/V G ff), mode P max , standard deviation <7Ar off , and the minimal 68%- and 
95%-credible intervals [10]. For the profile likelihood, we list the mode £^ ax and the intervals in 
which Axg ff < 1 and 4, respectively. 



Analysis 


WMAP7+ACT 




WMAP7- 


-ACT+HST 




Bayesian 


(N eB ) 


^raax 


0"JV cff 


68% MCI 


95% MCI 


<AW} 




aN cSS 


68% MCI 


95% MCI 


f/o-prior 


5.78 


5.68 


1.45 


4.18->7.12 


3.03^8.76 


4.37 


4.30 


0.72 


3.61^5.03 


2.96-^5.80 


# s -prior 


5.69 


5.20 


1.44 


4.02-^6.92 


3.01^8.59 


4.37 


4.28 


0.75 


3.57^5.05 


2.89^5.86 


f^A-prior 


5.67 


5.20 


1.46 


4.05^6.98 


2.90->8.65 


4.39 


4.28 


0.74 


3.60^5.08 


2.98^5.89 


Profile 




£ p 

-'-'max 




AXeff < 1 


Axcff < 4 




£ p 

-'-'max 




AXefl < 1 


Axcff < 4 


£ p 




4.73 




3.29-^6.14 


2.12^8.09 




4.07 




3.43^4.76 


2.79^5.50 
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4 Discussion 



We have demonstrated that constraints on the effective number of neutrino species, inferred 
from CMB data, can be subject to a slight discrepancy between the Bayesian marginalised 
posterior and the profile likelihood. This can be attributed to a volume effect primarily in 
the uj m direction, caused by the fact that the CMB data are directly sensitive mostly to the 
redshift of equality, not iV e g itself. 

Before we come to an interpretation, let us illuminate the statistical aspect of this 
result. Regarded from a sampling theory perspective, the mode of the full multi-dimensional 
posterior can be regarded as an unbiased estimator of the true parameter values (since in the 
present problem it coincides by construction with the maximum of the likelihood). In the 
process of marginalisation, this property is lost - the most probable value of N e s does not 
provide the best possible fit to the data, or, in other words, the mode of V(N e g) becomes a 
biased estimator of N e Q (see also [21] for a discussion, or [22] for another applied example). 
The profile likelihood on the other hand retains the unbiasedness of the mode estimator, but, 
unlike the marginalised posterior, it is not sensitive to volume effects, and thus does not have 
a formal statistical interpretation. 

In general it should not come as a surprise that, whenever the full posterior/likelihood's 
dimensionality is reduced, loss of information will be incurred. Marginalisation and profiling 
simply preserve different properties of their related multi-dimensional objects, and can thus 
be a good diagnostic of unusual features. A discrepancy between the two would point to 
a deviation from Gaussianity, and, from a Bayesian perspective, could for instance indicate 
that a certain amount of fine-tuning relative to the prior expectation is required in order to 
optimise the fit to the data. 1 

Finally, to evaluate the relevance of this effect, the magnitude of the bias should be set 
in relation to the intrinsic width of the marginal distribution. For WMAP+ACT data, the 
difference between profile and posterior is of order two thirds of a standard deviation, thus 
not contributing the dominant - but certainly a non-negligible - part to the indication for 
a non-standard N c q. With the addition of HST data, however, the bias is reduced it to less 
than one third of a standard deviation, and a similar trend is to be expected if one added, 
for instance, large scale structure data, or improved measurements of the CMB damping tail 
- be it existing ones from the South Pole Telescope [5], or upcoming ones from Planck. 

We conclude that the recent indication for a deviation of N e g from its standard model 
expectation cannot be accounted for by this statistical effect alone (though the presence of 
an additional statistical bias introduced, e.g., by the modelling of foregrounds, remains a 
possibility). 
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1 We remark that one could, in principle, choose the priors such that the discrepancy would vanish (e.g., 
here, a flat prior on z cq instead of N e g might be a good guess). But with A^ft arguably being a more 
fundamental quantity than z cq , it is doubtful whether such a choice could be reasonably justified from a 
theoretical point of view. 
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A Profiling with Markov chains 



In this section we present an estimate of how well the maximum of a probability distribution 
can be determined by using a Markov chain of length N. 

Let V be a probability distribution on an n-dimensional parameter space ^3, and ip 6 *}3 
be a point in this parameter space. We shall make two simplifying assumptions at this point: 
first, V(<p>) can be approximated by an n-variate Gaussian distribution, and second, all the 
samples in the chain are independent. Without loss of generality one can then take V(<p) to 
have unit variance and be centered around (^ m ax — 

0. Define 



AXeffM = -2(lnP(</w) - lnPfa)), 
and the volume fraction f x of V for which Ax^{<p) < x, 



(A.l) 



fx 



d<pV(<p), 



(A.2) 



with the volume V x implicitly given by the condition ip £ V x 4=> Xegi^f) < x. If one expresses 
(p in spherical coordinates, it can easily be shown that 



fx(n) 



%Jx 



dr 



1 



2vr) 



n/2 



exp 



n/2 



r(f) 



(A.3) 



where T is the Gamma function. If, instead of sampling from V, one generates the Markov 
chain with a temperature parameter T by sampling from V 1 ^ , equation (A.3) can be gen- 
eralised to 



fx(n,T) 



dr 



y-n/2 



n/2 







^exp 


2 



l/T 



„n-l 



(27T) 



n/2 



(A.4) 



If all A samples of the chain are independent, then the probability p that none of the points 
lie within f x is given by 



p(x,n,T,N) = (l-f x (n,T)) 



N 



(A.5) 



It follows triviality that the probability of at least one point of the chain being within 
AXeff = x of Xeff(v 9 max) is p(x,n,T, N) = 1 — p(x,n,T, N). For a few selected slices in 
(n, T, A)-space, p(x, n, T, N) is plotted in figure 4. 
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Figure 4. Probability of finding at least one sample within A%^ ff = x of the true maximum of 
the n-dimcnsional Gaussian posterior V, if the Markov chain was generated at a temperature T and 
contains N independent samples. Top left: dependence on N and n for T = 1 and a; = 0.2. Top right: 
dependence on N and T for n = 10 and x = 0.2. Bottom: dependence on AT and x for T = 1 and 

71 = 10. 



